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ABSTRACT 

The data stored within data centers often arrives in a state that is not immediately conducive to 
experimental endeavors. To harness its full potential, data must undergo a refinement process, 
transforming it into a format that computer systems can readily comprehend and utilize to execute 
the required actions. This paper focuses on the preprocessing of Manipuri Synset and Manipuri 
Corpus data, sourced from the TDIL data center, along with electronic dictionary data. The 
preprocessing tasks encompass the conversion of non-Unicode data to Unicode, spelling correction, 
tokenization for text segmentation, removal of stop words and stemming to reduce words to their 
root form. The primary objective of this paper is to prepare data for immediate use in word sense 
disambiguation for the Manipuri language using the Meitei/Meetei mayek script. This is pivotal for 
precise language understanding and semantic interpretation. Importantly, the processed data extends 
its utility beyond word sense disambiguation. It can be applied in various natural language processing 
(NLP) research areas including Machine Translation, Information Retrieval and Question 
Answering, where language comprehension is paramount. These preprocessing efforts offer a 
versatile tool for advancing language technology and facilitating in-depth research in the field of 
NLP. 
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1. INTRODUCTION 

Meiteilon, also known as Manipuri language is a Tibeto-Burman language spoken primarily in the 
Indian state of Manipur and some neighbouring regions of India and Myanmar. It has a unique script 
known as Meitei Mayek, which has been in use since many years. The language has a complex 
grammatical structure and its verbs are inflected for tense, mood, aspect and person. Meiteilon has 
been influenced by Sanskrit, Assamese, and Bengali languages and has borrowed many words from 
these languages. Meitei Mayek script details are shown in the following tables[14,15]. 


Table 1(a): Meitei Mayek 27 consonants 


Table 1(b): Meitei Mayek 8 half consonants 


T J F M Cc g U B 
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Pa ad: Meitei May 8 Te ia Lum is and pua 


C a ee en ee 


Table 1(d): Meitei Mayek 10 numerals 


Fa Taret Mapal 


Words may have different sense or meaning based on T context of its usage in a particular area or 
topic. Words having the same spelling with the same meaning have the ability to expressed or 
represent different situation or meaning based on the area where they are being used. This leads to a 
serious problem of ambiguity to the layman while categorizing different words. Even our human 
language has so many ambiguity as different words can be represented in different ways. This 
ambiguity creates lots of problem while expressing a statement. WSD provide an effective 
mechanism to resolve these problems. Not only in humans, in machine translation also these 
problems of ambiguity still remain a mystery which many researchers are trying to solve[7]. The 
unique characteristic that sets language processing is its reliance on language knowledge[6]. Word 
Sense Disambiguation is a sub branch of NLP (Natural Language Processing) which has the ability 
to determine, which meaning of word is activated by the use of word in a particular context and also 
deals with determining the intended meaning of a word in a given context. In other word, it is the 
process of identifying the correct sense of a word from a set of possible senses based on the context 
in which the word approaches. 

The structure of the paper is as follows: Section 2 provides a detailed examination of the work, 
including insights and preprocessing procedures. Section 3 is dedicated to the discussion and Section 
4 offers the paper's concluding remarks. 


2. PROPOSED METHODOLOGY 

In this paper, we focus on how the Manipuri related data to WSD are collected which are either 
directly readable by the computer or not. All these data are converted into Meitei/Meetei mayek 
machine readable format and the preprocessing task are then performed on these machine readable 
Manipuri Meitei/Meetei mayek data. The detailed explanation is discussed in this paper. 


2.1 DATA COLLECTION 

Manipuri language is a language that has very limited electronic data specially in the Meitei Mayek 
script. Through proper channel, the data from the TDIL are collected. These collected data comprise 
of many fields such as science, arts, literature and media. Art field contains data from economics, 
history, law, linguistics, philosophy, politics, psychology, religion andsociology while science data 
contains biology, botany, chemistry, geography, mathematics, medicine, physics, wild life, zoology 
and other related data. In literature, it contains the data of Arts and Crafts, criticism, culture, 
didactic, novel, short fictions, theatre and trivia. Lastly, media data contains magazine and 
newspaper. 


Table 2: Description of data collection 


Name of the resource Script Format Remark 
Manipuri Synset Meitei Mayek| Non — Unicode Need to be convert in Unicode format 
(IndoWordNet data) 
Manipuri Corpus Bengali Unicode Need to be converted to Meitei Mayek 
(Monolingual corpus) Unicode format 
Electronic Dictionary Bengali Unicode Need to be converted to Meitei Mayeki 


Unicode format 
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2.2 TASKS IN PREPROCESSING 


2.2.1 NON — UNICODE TO UNICODE CONVERSION 

Data are available in various formats as per the suitability of data. But when it comes to 
processing of data by the machine, the data in Unicode format works the best. Only few Indian 
regional languages data are available in Unicode format. Manipuri language is one of the Indian 
regional language which has limited electronic data and among the available data, most of them 
are in non-unicode format or in Bengali script. The necessity to Manipuri language data to be 
processed by the machine forces the non-unicode Manipuri data to be converted in Unicode 
Manipuri data[15]. Initially, 20% of data were converted into unicode format by manually 
typing or extracting the data from the TDIL IndoWordNet website. The non — Unicode data 
format is programmatically convertedinto Unicode data format by mapping of the non — Unicode 
character code with the Unicode charactercode of each character of Manipuri language(Meitei 
Mayek script) and it is stored in machine readable file format. 


Table 3: List of Meitei Mayek Unicode values 


$ 
w 
‘0 
N 
w 
A 
vi 


43979 ABCB 


& 
i 


a 
& 


a 
a 


The main advantage of keeping the Meitei/Meetei mayek data into unicode format is that any 
unicode data can be processed by the machine for any kind of research works. It also helps in 
converting one file format to another file format instantly with minimal efforts. For instance, 
conversion of .txt file format to .xlsx or .rtf or .csv can be done easily. 
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Table 4: Unicode Converted data 


Sample of Non — Unicode data Unicode converted sample data 


mnuQd czlCtr_ib atiTisiQ adu yAMn TunmC | RUM Were ty TÉT OTI wR KAT 
mnuQd czhLIClo LERT kUOR RX’ AICI] 
(Manungda changlaktriba atithising adu 


yamna thunamak manungda changhanlako) 


aYKoln yuMdgi heC TorCpg KzhOdn noQ TYRE RARE AIT Lermr vn RE 
cuTrCIMmi wl Roe rerkf 
(Eikhoina yumdage hek  thorakpaga 


khanghoudana nong chutharaklammi) 


akib UxtuN puCniQ soNThNb TEZ SÉU MIT CIAC 
(Akiba uttuna pukning sonthahanba) 


Manipuri Synset data, which are not in unicode data format are successfully converted using 
the above mentioned method. This method can be used of converting any other data which are 
not in unicode data format to the machine readable unicode data format. 


2.2.2 SPELLING CORRECTION 

Most of the data entry works are either done automatically or manually. If the data are entered 
automatically it may wrongly interpret in some special cases like non-ordinary grammatical 
rule, unsupported special character or symbols by the machine. Manual data entry may contain 
typo mistakes. Thus, it has become a necessity to recheck the collected data for correctness. In 
this paper, the Unicode converted data from the above preprocessing step contains spelling 
mistakes. The “atap” mayeks are consecutively contains in some words, which are not allowed in 
the Manipuri language(in Meitei Mayek script). This special case happened due to the fact that 
the rule applied for representing “atap” mayeks to store data in the non-unicode format changes 
when the “atap” mayeks are actuallyrepresented in the Unicode format. The converted data of 
Manipuri synset and Manipuri corpus contain the above mentioned mistakes. To rectify this 
error, every word is manually checked by human experts. Human experts not only check the 
spelling of the words but it also checks whether the used words in the sentence are also 
appropriate within the sentence. During this phase of preprocessing, some wrong entry of 
“apun” mayek and “ba” are also encountered, which are manually corrected. 

This preprocessing step is very time consuming yet very much needed to avoid misleading 
meaningof word or sentence, which play a very vital role in NLP applications like Word Sense 
Disambiguation, Question and Answering and Machine Translation to name a few. 


2.2.3 TOKENIZA TION 

Word Sense Disambiguation can be performed on word level or sentence level[7]. Hence, the 
whole data contains in the corpus needs to be brought down into smaller unit. The process of 
bringing downinto an individual smaller unit from the whole data in the corpus is termed as 
tokenization[4]. The individual word so obtained after tokenization are term as tokens. These 
tokens can be words, subwords, or characters, depending on the granularity of the tokenization 
approach. Tokenization canbe performed using various delimiter such as spaces, comma(,), full 
stop(.), enter etc[11,12]. 

Tokenization can be performed on sentence level or word level. Sentence level tokenization is 
the splitting of the given whole content into individual sentence while word level tokenization is 
the splitting the whole content of a corpus into a single individual word[12]. In this paper, 
tokenizationis performed by using spaces as a delimiter. Manipuri synset and Manipuri corpus 
data are broken down in individual word and hence word level tokenization is performed in this 


paper. 
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Sample output of tokenized data after tokenization: 

(RYO, UCT, MSY, TRAY, ‘7S, Uk’, KA, ROM’, SOC’, EER, THAT, SNR, 
WEF, MACTKRIYC', ULE, WH, TROD, ‘ROM, REIR, 'UR,, UHH, TROT, 
MACK", 'R RR’) 

General benefits of performing tokenization includes vocabulary creation, text preprocessing, 
contextual understanding, data size reduction, feature extraction, standardization and 
normalization etc. It also helpful in context understanding by capturing nuances, disambiguate 
word meanings, anduncover patterns within the text[2]. 


2.2.4 STOP WORDS REMOVAL 
Words in text data with context-independent meanings are termed “stop words”. Their inclusion 
during processing adds to time and storage demands. Removing these stop words overcomes 
these limitations. The following 40 stop words are considered in this paper. 

Table 5: List of Manipuri stop words 


W’(Ei) fe” Wf (mahakki) fe” I'(Mahak) 


? EFFY 


E SRT Mayokta) 
UY’ (Eikhoige) | kr IT(Makhoige) TOFFf(Asige) WZ ¥(Kanabu) 


Or Rormrt T¥ SR T(Natr RACY ke (M 
(Eikhoisingge) ELEERI eae K 

kk US(Mamangda) | # CfI(Houjik) fe’ 5s(Makhoi) TR (Adu) 
THO D(Amasung) @°sT CT (Loinana) fe A 1 (Mahak) ©” Of (Tousi) 
fe & F Stf( Maramdi) UF’ R(Eikhoi) il ¥(Ngamba) BO I'f(Esage) 


Consider the following sentence: 
EIE? ERIET rei RATE CUR Germ Umr É IVE 
(Mige machinjakti mahakki langda thurakpa apikpa tilsingni) 


The result that we obtained after removing the stop words is 
kfrf FRIET TE? TIR Lerm Tm TrM HIO IE? 


(Mige machinjakti mahakki langda thurakpa apikpa tilsingni) 


2.2.5 STEMMING 

The process of bringing down to the root form of a given word is known as stemming. The 
reduced word may be meaningful or not meaningful unlike lemmatization. The main use of 
stemming is to boost up the accuracy of the NLP tasks. 

There are several ways of performing stemming. Two most popular stemming algorithm 
includes Porter stemmer, Lancaster stemmer and Snowball stemmer. For Manipuri language, S. 
Poireiton[13] et.al uses suffix stripping mechanism to reduce the Manipuri Bengal script word to 
its the root forms. The main disadvantage includes the understemming and overstemming of the 
words. Various sources indicate that there are variations in the number of suffixes and 
prefixes[9,13]. The following 91 suffixes and 12 prefixes can be considered for stemming. 
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Table 6: List of Manipuri Suffixes 
(Gidana) (Dubuna) (Gidamaktne) | (Sina) (Dubude) | (Sidage) (Dudage) 
(Gidade) (Dana) (Gidade) (Duna) (Gi) (Dagede) (Sigeda) 
Sigedade) Sidana) Sigede) Nade) Duge) Sidagede) Gida) 
(Dugidadi) (Dudana) (Dugede} (Sinade) (Side) (Dugedade) | (Dunade) 
| 
(Dugiga) (Dagena) (Douna) (Dunade) | (Sige) (Ga) (Na) 
fo ee i eels lee ee 
(Gigade) (Sidagena) Su (Sinabude) | (Gide) (Siga) (Sina) 


e a A a 
(Gina) _ (Dudagena) (Gisu) (Dine) (Dage) (Duga) (Duna) 
ha 
( igina) (Sigedamak) (Gidasu) (Sidene) (Da) (Gede) (Nade) 
a a 
Dugina) Gidamak) Busu) Dudene) Sida) Sigade) Sinadi) 
(Ginade) (Dugeda) (Dounade) (Bu) (Duda) (Dugedi) (Sigenade) 
(Dugidamak) | (Dhounabu) (Sibu) (Dade) (Gina) (Dugenade) | (Gi) 
(Sidouna) (Dubu) (Sidade) (Sigena) (Buna) (Sige) (Sidounabu) 
Bude) Dudade) Dugena) Sibuna) Duge) Na) Sibude) 


Table 7: List of Manipuri Prefixes 
E AE E (S| a 


Sample output: 


MWTETIR to MIETI, FASET to KFS, N ERR to NEF etc. 


2.2.6 Preparation of ambiguous words 

Initially Manipuri ambiguous words are prepared from the IndoWordNet data. Those words in 
IndoWordNet which has more than one sense are ambiguous words. But in this paper, only 
those word which has the same spelling are considered. Also, from Manipuri Electronic 
dictionary, the ambiguous data are selected manually by human experts. 


A word is said to be ambiguous if one of the following conditions is satisfied: 
a) Ifa word in the IndoWordNet contains the more than one synonym. 

b) Ifa word is used in different parts of speech. 

c) Ifa word has more than on meaning in a dictionary. 
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Table 8: List of Manipuri ambiguous words 
THET T'I m’’F UM Cs | SEF mef 
Ui OF HACS ms UAT eum RETT 
TO’ X UUs wUrm TRAM | HRS afse 
merce T’ M m PUU T'E a9 
mm 
= 


Hp 


rs FI- FF kf PEY PE“ 
(Khoi) (Khong- (Mi) (Enba) (Erei) (Epa) 
hamba) 

ERG FI m’ PEII PVE ums 
Ef FU i EEM O CUY ç 

@nofey | MEY mf PME Ff as 
yIeTy wy T aco PMX 5 Uns 
uns m T Tim PME ET 


2.2.7 SENSE INVENTORY 

For Manipuri language, there is almost negligible amount of e-resources. Development of 
dataset of printed script of Manipuri language have already started, which has become a 
necessity for such least researched language[8]. Sense Inventory is the database of the Manipuri 
words that contains the meaning of the word and an example sentence that contain that word. 
The key principles encompassing sense inventories are clarity, coherence and comprehensive 
inclusion of the entire spectrum of significant meaning differentiations[1]. The entries in this 
database are mainly taken from the Manipuri IndoWordNet data available at TDIL data center 
and the Manipuri Electronic data, which are taken through proper channel. Here, data with single 
meaning as well as multiple meanings are stored together. 


=| 
@ 


The structure of IndoWordNet’s Manipuri data is shown below: 

ID: & 

CAT: ADJECTIVE 

CONCEPT: UEZ U’ RZ((Afaba oidaba) 

EXAMPLE: PWI CFF? CFCC TOf k PI PRAS waref ART C'U 
(Houdongna lambi lamlanba asi mangol oidaba thoudokni haina lounei) 
MANIPURI-SYNSET: k MI WHR WS NY (Mangol oidaba douyadaba) 

Here, MANIPURI-SYNSET is the word(sometimes also contains the synonymous words) for 
which the details are represented in the above format, ID represent the unique identification 
number that has been assigned to a word, CAT is the part-of-speech of the word, CONCEPT 
represent the meaning/sense of the given word and EXAMPLE is the sentence that contains the 
word. 
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The database is stored in the following structure for every word: 

a) ID — Identification number of a word. 

b) WORD — A word. 

c) POS — Associated part-of-speech of the word. 

d) Sensel, Sense2, Sense3 etc — Associated sense(s) of the word 

e) Example Sentencel, Example Sentence2, Example Sentence3 etc — Associated sentences 
that contains the word. 


Table 9: Sample look of the Sense Inventory 


1D, WORD,POS,SENSE1 EXAMPLE SENTENCE, POS, SENSE2,EXAMPLE SENTENCE2,POS, SENSE3, EXAMPLE SENTENCE3 
TRENOUNSTY UR ERIC RO RM LOM ENI CRM C ERE MYON URC CO ORT RAR TRITIM 
SPOCK, SRAM ERIET THI RATE COR LIM TMIN HIVOTT NOUNEN I UU? MUY 
TALRY ZIS UR RAU ROM CON EER TRAU IRA MUE METRIEN UZKI TS TROD 
ROM RUIA UFR, URF CRO MEURT R ARRIT RIU ROM COC EER TRAU IRA 
WUT REURTÉR UZ'RY T'RNOUN MERRY MY OUTRO TOT-ÉO0 Re Tes 


OTÉREA TU RATE ER CRUE RIONI E'H Tf 
The Sense Inventory consists of 16351 words, in which 10185, 2024, 332 and 3810 are noun, 
verb, adverb and adjective respectively. 

With respect to Word Sense Disambiguation, we can accumulate information related to the 
word’s senses from WordNet, including synonyms, glosses, example sentences, hypernyms and 
meronyms to measures the overlap between the context and the sense bag using intersection 
similarity, allowingthe most probable sense to be determined based on maximum overlap. The 
efficiency of knowledge-based contextual overlap WSD _ algorithms using 
WordNet/IndoWordNet can be increased by the use of diverse glosses, longer glosses, proper 
nouns, enriched synset structures, frequently used terms, and distributional constraints[5]. 


3. Results and Discussion 

The above preprocessed data can play a vital role in performing many Natural Language Processing 
tasks such as Machine Translation, Information Retrieval, Word Sense Disambiguation, Sematic 
Analysis, Question and Answering, Word Sense Induction, Text Classification etc. 

Tokenization is the basic preprocessing step in every NLP application. The tokenized data can be 
used to further study an individual word or a sentence separately. In other word, it will be helpful in 
performing Morphological study of the Manipuri Language. 

When applied to WSD tasks, tokenization gave advantages in terms of granularity, context 
preservation, simplification of feature selection, computability with language models and evaluation 
and reproducibility[11,12]. All these benefits gave a path in solving problems of WSD. The 
tokenized data obtained in this paper will be mainly beneficial to the word level Word Sense 
Disambiguation. By incorporating spelling correction in the WSD process advantages like 
enhancing word recognition, expansion of vocabulary coverage, contextual coherence and robustness 
to a noisy data are achieved. 

Performing stemming in Natural Language Processing (NLP) can have advantages for word sense 
disambiguation tasks. The advantages of incorporating stemming into the word sense disambiguation 
process includes reduction of lexical variations, improving coverage and recall, dimensionality 
reduction and improve efficiency[14]. 

However, it is important to note that stemming is a simplification technique that can lead to loss of 
information. Stemming may result in the merging of different word senses or the creation of false 
stems that do not accurately represent the intended meaning[12, 4]. Consequently, stemming should 


@2023, IJETMS | Impact Factor Value: 5.672 | Page 439 


LAYS 


International Journal of Engineering Technology and Management Sciences 
Website: ijetms.in Issue: 6 Volume No.7 November - December — 2023 
DOI: 10.46647/ijetms.2023.v07i06.062 ISSN: 2581-4621 


be applied judiciously and in combination with other techniques to enhance word sense 
disambiguation accuracy. 

The unicode conversion program can be very useful in creating the large corpus of Manipuri Meitei 
Mayek data, which has very limited electronic data. Bengali script data are available in plenty. The 
local newspaper data are now the rich source of Bengali script Manipuri data. This program code can 
convert these Bengali script Manipuri data into Meitei Mayek script data. 

Also, Sense Inventory data can correctly translate word(s) from other language to Manipuri language. 
The same Sense Inventory will act as a sole repository to carry out Word Sense Inventory for 
Manipuri language. Since, this repository contains all the senses of the Manipuri words, strong and 
heuristic search can be performed easily and results can be obtained instantly. The same Sense 
Inventory can be a useful aid in performing the Semantic Analysis, as this repository contains the 
relationships of word with other words. The relationship includes Hypernymy, Hyponymy, 
Meronymy, Synonym and Antonyms of the Manipuri Meitei Mayek words. 


4. CONCLUSION AND FUTURE WORK 

Due to the lack of e-resource of the Manipuri Meitei/Meetei mayek data, lots of processing work was 
carried out in the available data to bring mentioned data in the machine readable format. Further, to 
make these machine readable data into a WSD usable data various NLP preprocessing steps like 
spelling correction, tokenization, stop word removal etc were also carried out. These NLP 
preprocessing tasks was of very lengthy and time consuming process, which cannot be also omitted. 
These preprocessing steps are required to yield accurate results and build a promising WSD system 
for Manipuri language. For research purpose, bigger the size of the corpus better will be the 
performance of the developing system. Hence, above to these preprocessed data, more and more data 
can be collected by the future researchers and store into this corpus so as to increase the corpus size 
and make this corpus the gold standard corpus for the Manipuri language. This work being the first 
of its own kind for Manipuri language, surely further works can be carried to convert Manipuri 
language as a limited e-resource to a plentily available data resource language. The main benefit of 
this paper is that the Manipuri NLP research team can hand pick up this processed data and use at 
will to achieve their desired task. 
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